Skip to content

Build tent#2089

Merged
staryxchen merged 6 commits into
kvcache-ai:mainfrom
Dao007forever:build_tent
May 28, 2026
Merged

Build tent#2089
staryxchen merged 6 commits into
kvcache-ai:mainfrom
Dao007forever:build_tent

Conversation

@Dao007forever

@Dao007forever Dao007forever commented May 13, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes needed to build and run TENT without USE_REDIS, plus NVLink transport robustness fixes.

  • fabric_allocator.cmake: Add POST_BUILD so the fabric allocator's custom build script runs after the target is built, not before.
  • tent/CMakeLists.txt: Gate tests/ behind BUILD_UNIT_TESTS so TENT can build with unit tests disabled.
  • tent/src/runtime/transfer_engine_impl.cpp: Drop the tent/metastore/redis.h include and replace the REDIS_*_DB_INDEX macros with local constexpr constants — the header isn't compiled when USE_REDIS=OFF, but the DB-index validation is still wanted.
  • tent/src/transport/nvlink/nvlink_transport.cpp:
    • Save/restore the current CUDA device around IPC operations (cudaIpcGetMemHandle, cuMemGetAddressRange) so they run on the device the buffer was allocated on, then restore the caller's device.
    • Detect driver-allocated (VMM / cuMemCreate) pointers via cuMemRetainAllocationHandle and skip CUDA IPC export for them, since cudaIpcGetMemHandle only supports cudaMalloc-backed memory.
    • Log a descriptive error (addr, base, device, CUDA error string) when cudaIpcGetMemHandle fails, instead of just propagating the macro failure.
W20260512 17:32:07.779717 276846674057088 transfer_engine_impl.cpp:684] InternalError: cudaIpcGetMemHandle(&handle, (void*)base_ptr): invalid argument
    Raised at /home/inf-daole/Mooncake-dao/mooncake-transfer-engine/tent/src/transport/nvlink/nvlink_transport.cpp:263

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Built TENT locally with USE_REDIS=OFF, BUILD_UNIT_TESTS=OFF, USE_CUDA=ON, USE_MNNVL=ON via `scripts/build_local_cuda_tent.sh`. Exercised the NVLink transport against PyTorch-allocated tensors (caching allocator sub-allocations) and against driver-allocated VMM buffers to confirm both paths are handled.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using `./scripts/code_format.sh` before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the build system and CUDA transport logic by introducing a local CUDA build script, making unit test compilation optional, and improving the NVLink transport layer with CUDA device context management and VMM allocation support. Feedback from the review focuses on improving script portability by removing hardcoded user paths, addressing a security vulnerability in LD_LIBRARY_PATH construction, and ensuring robust error handling for CUDA driver API calls.

Comment thread scripts/build_local_cuda_tent.sh Outdated
Comment thread scripts/build_wheel.sh Outdated
Comment thread scripts/build_wheel.sh Outdated
Comment on lines +45 to +49
namespace {
constexpr uint8_t kRedisMaxDbIndex = 255;
constexpr uint8_t kRedisDefaultDbIndex = 0;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add these constants instead of reusing REDIS_DEFAULT_DB_INDEX in elseware?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are building without USE_REDIS?

Comment thread scripts/build_local_cuda_tent.sh Outdated

@staryxchen staryxchen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but need to check CI status

@Dao007forever

Copy link
Copy Markdown
Contributor Author

CI failing with /usr/bin/ld: final link failed: No space left on device, is the node full?

@staryxchen

Copy link
Copy Markdown
Collaborator

CI failing with /usr/bin/ld: final link failed: No space left on device, is the node full?

You can push an empty commit to trigger CI again

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@staryxchen staryxchen merged commit 4569ce7 into kvcache-ai:main May 28, 2026
20 checks passed
A-Liuhao pushed a commit to A-Liuhao/Mooncake that referenced this pull request Jun 25, 2026
* Build with TENT

* Fix TENT failed start

* Revert

* Format

* Empty
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants